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Abstract 



Semiparametric accelerated failure time (AFT) models directly relate the predicted failure 
times to covariates and are a useful alternative to models that work on the hazard function or 
the survival function. For case-cohort data, much less development has been done with AFT 
models. In addition to the missing covariates outside of the sub-cohort in controls, chal- 
lenges from AFT model inferences with full cohort are retained. The regression parameter 
estimator is hard to compute because the most widely used rank-based estimating equations 
are not smooth. Further, its variance depends on the unspecified error distribution, and 
most methods rely on computationally intensive bootstrap to estimate it. We propose fast 
rank-based inference procedures for AFT models, applying recent methodological advances 
to the context of case-cohort data. Parameters are estimated with an induced smoothing 
approach that smooths the estimating functions and facilitates the numerical solution. Vari- 
ance estimators are obtained through efficient resampling methods for nonsmooth estimating 
functions that avoids full blown bootstrap. Simulation studies suggest that the recommended 
procedure provides fast and valid inferences among several competing procedures. Appli- 
cation to a tumor study demonstrates the utility of the proposed method in routine data 
analysis. 

Keywords: induced smoothing; multiplier bootstrap; resampling 



1 Introduction 



A case-cohort design ( jPrenticd . Il986l ) is an effective and economical design which reduces 
the effort and cost of a full-scale cohort study. Such design originated to allow efficient 
analysis of studies where it is too expensive and time consuming to collect and analyze data 
on all subjects. Cases and controls refer to subjects who have and have not, respectively, 
developed the disease of interest by the end of the study period. A case-cohort design is 
typically composed of two steps. First, a subset called sub-cohort is randomly selected from 
the whole cohort regardless of their disease status. Second, the remaining cases in the cohort 
are added to the sub-cohort. Cases and controls refer to subjects who have and have not, 
respectively, developed the disease of interest by the end of the study period. Measurement 
on the main risk factors are taken only on subjects in the sub-cohort and the remaining 
cases outside of the sub-cohort. This leads to substantial reduction in the effort and cost of 
conducting large scale cohort studies, especially when the disease of interest is rare or the 
main risk factors are expensive to measure. 

A semiparametric accelerated failure time (AFT) model is a log-linear model for the fail- 
ure times with unspecified error distribution. It directly relates the failure time to covariates 
such that the effect of a covariate is to multiply the predicted failure time by a constant. For 
failure time data from case-cohort studies, most statistical m ethods have f o cused on semi- 



parametric models th a t work on either the hazard function (iBarlowl. Il994j; iKang and Cai 



2009; K ulich and Lid. 120001: iLin and Yind . 119931 : iPrenticd . 119861: ISelf and Prentice 



1988 



Sun et al. L 2004 L Therneau and Li , 1999 ) , or the survival function ( Chen , 2001aUbl: Kong et al 



20041: ILu and Tsiatisl . l2006 l). Parametric AFT models were considered by lKalbfleisch and Lawless] 



( 119881 ). Inferences about semiparametric AFT models fo r case- c ohort data are much l e ss de- 



veloped, with only a few recent works (IKong and Cail . 120091 ; iNan et all 120061 : lYul . 12011 



Yu et all l2007h . 



Inferences for semiparametric AFT models have been difficult for not only case-cohort 
data but also for complete data. The most importan t estimator is t he rank-based estimator 
motivated from inverting t he weighted l o g-ran k test ( jPrenticd . Il978[ ). with asymptotic prop- 
erties rigorously studied (ITsiatisl . Il990t lYingl . Il993[ ). Nevertheless, the estimator has not 
been as widely used as it should be due to lack of efficient and reliable computing algorithm 
to obtain both parameter estimates and their standard errors. 

The parameter estimates are hard to compute because the most widely used rank-based 
estimating equations are not smooth. Recent works shed light on bringing AF T models 
into routine data analysis practice, including case-cohort studies. |jin et al.l (120031 ) exploited 
that the rank-based estimating equation with Gehan's weight is the gradient of an objective 
function and obtained estimates by solving it with lin ear programming. This approach was 
adapted to case-cohort data by IKong and Cail (120091 ). Nevertheless, the optimization with 
linear programming is still computationally very demanding, especially for larger sample 
sizes. A mor e computing efficient app roach for rank-based inference is the induced smoothing 
procedure of lBrown and Wa ng (20 07|). This ap proach is an application of the general induced 
smoothing method of lBrown and Wangl (120051 ). where the discontinuous estimating equations 
are replaced with a smoothed version, whose solutions are asymptotically equivalent to those 
of the former. The smoothed estimating equations are differentiable, thus facilitates rapid 
numerical solution. 
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Direct estimation of the variance is difficult because it involves nonparametric estimation 
of the unspecified err or distribution. Most existing methods rely on bootstrap which is very 
computing intensive. iJin et al.l ( 120031 ) estimated the variance through a multiplier resampling 
method, which requires a large bootstrapping samp le in order to obtain a reliable variance 
estimate. For case-cohort data. iKong and Cail (120091 ) adopted a specially designed bootstrap 
procedure (IWacholder et al.l . ll989l ). The demanding computing task in linear program ming is 
ampli fied because it requires solving estimating equations for each bootstrap sample. [Huang 
( 120021 ) proposed an easy-to-compute variance estimator based on the asymptotic linearity 
property of the estimating equations. A decomposition matrix of the variance matrix is 
estimated by solving estimating equations, but the number of the estimating equations to 
solve is much smaller; it is just the dimen sion of the parameters. For general nonsmooth 
estimating functions, IZeng and Linl (120081 1 proposed a resampling strategy that does not 
require solving estimating equations or minimizing objective functions. Instead, it only 
involves evaluations of estimating functions and simple linear regression in estimating the 
slope matrix. The resulting variance estimators are computationally more efficient and stable 
than those from existing resampling methods. 

In this article, we propose a fast rank-based inference procedure for semiparametric AFT 
models in the context of case-cohort studies. The parameters are estimated with an induced 
smoothing approach. Variance estimators are obtained through an efficient resampling meth- 
ods for nonsmooth estimating functions that avoids full blown bootstrap. Of course, the 
methods also apply to full cohort data. 

The rest of this article is organized as follows. Point estimation procedures based on 
smoothed estimating equations for case-cohort data when the sub-cohort is a simple random 
sample from the full cohort are proposed in Section [2J Four variance estimation procedures, 
one based on full multiplier bootstrap and three based on possibly multiplier bootstrap- 
aided sandwich variance estimator, are proposed in Section [3j A large scale simulation study 
is reported in Section HI comparing the performances of the variance estimator and their 
timings. The methods are applied to a tumor study with both case-cohort data and full 
cohort data in Section A discussion concludes in Section [HI 



2 Point Estimation 

Let {Tj, Cj, Xi}, i — 1, . . . , n, be n independent copies of {T, C, X}, where Tj and Cj are log- 
transformed failure time and log-transformed censoring time, Xi is a p x 1 covariate vector, 
and given X, C and T are assumed to be independent. A semiparametric AFT model has 
the form 

Ti = X?/3 + €i, i = l,...,n, 

where /3 is an unknown p x 1 vector of regression parameters, e«'s are independent and 
identically distributed random variables with an unspecified distribution. It is also assumed 
that e^s are independent of Xi. 

In a full cohort study, due to censoring, the observed data are (Yi, A i; Xi), i = 1, . . . , n, 
where Y$ = min(Tj,C , j), A, = J[T, < Cj], and /[•] is the indicator function. A rank based 
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estimating equation with Gehan's weight is 

n n 

U n(P) = E E A ^ - Xj) 1 ^) * em = 0, (1) 

i=l j=l 

where ej(/3) = — Xj0 . The r oot o f (p]) is consistent to the true parameter /3 , and 
is asymptotically normal (Tsiatis, 1990l ). Despite these nice pro perties, even fo r the most 
promising method to date that solves it via linear programming (IJin et all 120031 ). the com- 
puting burden increases drastically when bootstrapping is used to estimate the variance of 
the estimator. 

For a case-cohort study, the covariate vector X^s are not completely available for each 
individual. Measurement of some covariates is taken only on the subjects in the sub-cohort 
and cases outside the sub-cohort, and, thus, estimating function (TTJ cannot be evaluated. 
Using the observed data naively in p]) would lead to misleading results because the case- 
cohort sample is biased — it includes all cases but only a fraction of controls. It is possible, 
however, to adjust the biases by incorporating a weight that depends on the selection scheme 
of case-cohort samples. Suppose we select a sub-cohort of size n by simple random sampling 
without replacement from the whole cohort. Let £j be the sub-cohort indicator; £j = 1 if 
the ith observation is in the sub-cohort and = otherwise. Let p = lim^ooPn, where 
Pn = Ti/n is the sub-cohort inclusion probability. Under these assumptions, the desired 
case-cohort weight is hi — Aj + (1 — Aj)£j/p n . The weight-adjusted estimating equation (TTJ) 
becomes 



KiP) = E E ~ WW) > eiCS)] = 0. 



(2) 



i=i j=i 



The solution to (f5]), n , remains to be consistent and asymptotically normal (IKong and Cai . 



2009|) 



For full cohort data, a computationally more efficient ap proach for rank-based in ference 
with Gehan's weight is the induced smoothing procedure of iBrown and Wang! (120071 ). Such 
smoothing method leads to continuously different iable estimating equations that can be 
solved with standard numerical methods. Let Z be a p-dimensional standard normal random 
vector. The estimating function U n (f3) in ([TJ is replaced with E[U n {j3 + n -1 / 2 Z)], where the 
expectation is taken with respect to Z. This lead to 



i=l j=l 



0. 



(3) 



where rf- 



Xj) and $(•) denotes the standard normal cumulative 



n-\X t - X J ) T (X i 

distribution function. The solutio n to (j3)) is consistent to 0n a nd ha s the same asymptotic 
distribution as the solution to (CQ) ( 1 Johnson and Strawdermanl . 120091 ). 

For case- cohor t data , we propose a smoothed version of (J2J) by adapting the idea of 
Brown and Wana (120071 ). Specifically, we replace U°(f3) with E[U£(P + n~ x l 2 Z)\ to obtain 
the induced smooth version of O), 



= 1 3=1 



i.i 



(4) 
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The solution j3 n to (jlj) is a consistent estimator to (3q and is asymptotically normal. Further- 
more, the asymptotic distributio n of (3 n is also the same as that of (3 n . These arguments can 
be justified similarly as those in iJohnson and Strawdermanl (120091 ). 



3 Variance Estimation 

The asymptotic variance of (3 n is even harder to estimate for case-cohort data than for full 
cohort data because of the extra complexity caused by the data structure. The terms in the 
summation in U n (/3) are not independent since the sub-cohort is drawn from the full cohort 
without replacement. We propose four variance estimators; one is fully resampling based 
while the other three use resampling to a component of the sandwich variance estimator. 

3.1 Multiplier Bootstrap 



The multiplier bootstrap estimator of |Jin et al.l (120031 ) is adapted to case-cohort data by 
inserting proper case-cohort weights, hiS, in the multiplier bootstrap estimating equations. 
Let rji, i = l,...,n, be independent and identically distributed positive random variables 
with E(r/j) = Vd,r(i]i) = 1. Define 



=1 3=1 



e,(/3)-e 4 (/3) 



i.i 



(5) 



For a realization of (771, ... , r] n ), the solution to (JSJ) provides one draw of (3 n from its asymp- 
totic distribution. By repeating this process a large number B times, the variance matrix of 
j3 n can be estimated directly by the sampling variance matrix of the bootstrap sample of {3 n . 

Since the asymptotic variance of j3 n is the same as that of j3 n , the covariance matrix 
of f3 n can also be estimated by ([2} through multiplier bootstrap. This is, however, not 
recommended because it would need to solve a large number B nonsmooth estimating equa- 
tions. As will be seen in our simulation study, even with the computationally more efficient 
smoothing estimating equations, the multiplier bootstrap approach can still be very time 
consuming, especially for larger sample sizes or more covariates. 



3.2 Sandwich Estimator 

To improve the computational efficiency, we consider alternative variance estimation proce- 
dures based on the sandwich form that avoid solving estimating equations repetitively. The 
asymptotic variances of n and n are the same, both having a sandwich form. Under some 
regularity conditions (IZeng and Linl . 120081 ). uniformly in a neighborhood of (3 , equation (j2J) 
can be expressed as 



rT l ' 2 U c M = n- 1 ' 2 hiSi(p ) + An 1 ' 2 ^ - fa) + o p {l + n l ' 2 \\fi - /3 ||), 



i=l 



where Si (A)) is a zero-mean random vector, and A is asymptotic slope matrix of n l / 2 U^{f3o). 
The analytical details of Sj(/3 ) for case-cohort data is presented in the Appendix. The 
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asymptotic variance matrix of \/n(j3 n — j3 ) is nS = nA~ 1 V(A~ 1 ) T , where V is the variance 
of n~ 1//2 X^r=i KSiiPo)- Estimation of £ involves estimating V and A by estimator V n and 
A n , respectively. The variance estimator then has the sandwich form £„ = A^V^A^ 1 ) 1 . 

3.2.1 Estimation of V 

Matrix V can be estimated either through a closed-form estimator or through bootstrapping 
the estimating equations. For case-cohort data, due to the correlated feature of &'s in /ij's, V 
is different from its full cohort counterpart. There are two sources of variations contributing 
to V: variation due to the sampling of a full cohort (Vi) and variation due to the sampling 
of a sub-cohort within the full cohort (V2). m particular, we have 

V = V, + = E [S^S^Y] + i^Var [(1 - A t )S t ((3 )] , 

p L J V 

where V2 vanishes if full cohort data are available. 

Closed-form With explicit expressions for Si((3)'s in the Appendix, a closed-form estima- 
tor of V is 

Ki = Vln H Vzn 

Pn 

where 

n 

i=i 

and 

n ( n If" 

V 2n = n~ l h(l-^)Si(P n )S70 n )-\ n- 1 Ml " &i)Si($ n ) \\ n 'Yl h ^ - ^)&Wn 

i=l L i=l J I i=l 

and Si((3 n ) is obtained by replacing unknown quantities in Si(/3) with their sample counter- 
parts. 

Multiplier Bootstrap When Si(j3 n ) have complicated e xpressions, it is more convenient 



and perhaps more accurate to estimate V via bootstrap (jZeng and Linl . 120081 ). Because 
and have the same asymptotic distribution, we apply the multiplier bootstrap ap- 
proach to U^. Evaluation of (p]) at f3 n with each realization of (771, . . . , r) n ) provides one 
bootstrap replicate of U°*(f3 n ). With B replicates, we estimate V by the sample variance 
of the bootstrap sample of U°*(f3 n ). The bootstrap here is much less demanding than the 
full multiplier bootstrap above, because it only involves evaluations of estimating equations 
instead of solving them to obtain each bootstrap replicate. 

3.2.2 Estimation of A 

With V estimated by V n , we next propose three approaches to estimate the slope matrix 
A. Depending whether V n is based on closed-form or multiplier bootstrap, we will have two 
versions of estimator of £ for each approach of slope matrix estimation. 
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Induced Smoothing With U°, the smoothed version of U°, the slope matrix A can be 
estimated directly by 

n op 1 

The close-form expression of A n can be evaluated easily. The variance estimator then has 
the sandwich form E n = A^V^A^Y . 



Smoothed Huang's (2002) Approach iHuand (120021 ) avoided the difficulty in estimating 



the slope matrix of nonsmooth estimating equations by exploiting the asymptotic linearity of 
the estimating equations. Nevertheless, this approach still requires solving p nonsmooth esti- 
mating equations, whose convergence may be a problem. We adapt Huang's approach by re- 
placing the p nonsmooth estimating equations with their smoothed versions. Let V n = L^L n 
be the Cholesky decomposition of V n . Let q n j be the solution to the following estimating 
equations for 7, j — 1, . . . ,p, 

where lj is the jth column of L n . The solutions can be obtained with from gen eral purpose 
nonli near equat ion solvers; in our imp l ement ation we used R packages nleqslv (jHasselmanl . 
20121 ) and BB (IVaradhan and Gilbertl . 120091 ). Let Q n be the matrix whose jth column is 
Qnj — fin- Then Q^Qn is an estimate of E. 

With the adaptation to smooth estimating equations, this approach has an advantage 
compared to the induced smoothing approach in that the closed-form derivative matrix is 
not required, and, hence, can be applied to more general nonsmooth estimating equations. 



Zeng and Lin's (2008) Approach IZeng and Linl (120081 ) proposed to estimate the slope 
matrix by regressing the perturbed estimating functions on the perturbations. Let Zj,, b = 
1, . . . , B, be B realizations of a p-dimensional standard normal random vector. For case- 
cohort data, let be the jth component of U°. We estimate the jth row of A, j = 
l,...,p, by A n j, the least squares estimate of the regression coefficients when regressing 



n l l 2 U n j(f3 n + n 1 l 2 Z b ) on Z b , i = l,...,n. The variance estimator also has the sandwich 
form t n = A~ 1 V n (A~ 1 ) T . 

This approach differs from the induced smoothing approach in that the slope matrix A 
is estimated via a resampling procedure that involves p least squares regressions, instead of 
taking the derivatives of a smooth function. It can be viewed as an empirical version of the 
induced smoothing approach. 



4 Simulation 

We conducted an extensive simulation study to assess the performance of the our point and 
variance estimators. Failure time T was generated from AFT model 

log(T) = 2 + X l + X 2 + X 3 + e, 

where X\ was Bernoulli with rate 0.5, X 2 and X 3 were uncorrelated standard normal vari- 
ables. Censoring time C was generated from unif (0, r) where r was tuned to achieve desired 
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censoring rate C p . The distribution of e had three types: standard normal, standard logistic, 
or standard Gumbel, abbreviated by N, L, and G, respectively. The censoring rate C p had 
two levels, 90% and 97%, representing a mildly rare disease and a very rare disease, respec- 
tively For the mildly rare disease, the full cohort size was set to be 1500 and the case-cohort 
size was set torn — 300 on average. For the very rare disease, the full cohort sizes were set to 
be 1500 and 3000, each with case-cohort sizes averaged at m G {150,300}. The sub-cohort 
sampling proportion p n was set to yield the desired average case-cohort size given censoring 
rate and full cohort size. For each viable combination, we generated 1000 datasets. 

Given a dataset, point estimates of regression coefficients were obtained from both non- 
smooth and smoothed estimating equati ons. The estima tor from the nonsmooth version 



was obtained using linear programming (IJin et al.l . 120031 ). denoted by LP. The estimator 



from the induce d smoothing appro ach with estimating equations @ was obtained using R 
package nleqslv (jHasselmanl . 120121 ) . denoted by IS. The two estimators are expected to be 
asymptotically the same, but with the IS estimator obtained much faster. Eight variance 
estimates were computed for the point estimate. The first two were full multiplier bootstrap 
estimates, denoted by MB, one based on the LP approach and the other based on the IS 
approach. The rest six were sandwich estimates constructed by combinations of three ap- 
proaches to estimate A and two approaches to estimate V. We use abbreviations IS, SH, 
and ZL to denote the induced smoothing, smoothed Huang's, and Zeng and Lin's approach 
for A, respectively. We use abbreviations CF and MB to denote the closed-form estimate 
approach and the multiplier bootstrap approach for V, respectively. 

Results for the mildly rare disease case with censoring percentage C p = 90%, full cohort 
size 1500, and average case-cohort size fh = 300 are summarized in Table [TJ Both the 
LP and the IS estimators appear to be virtually unbiased. In fact, they agreed with each 
other closely on a 45 degree line (not shown). Consequently, their empirical standard errors 
agreed with each other, and their bootstrap based standard errors agreed with each other. 
The bootstrap standard errors and the empirical standard errors match closely, suggesting 
that the bootstrap variance estimators provide good estimation of the empirical variantion. 
The other six standard errors based on sandwich variance estimators agreed quite well with 
the empirical standard errors too. The associated 95% confidence intervals based on all eight 
standard errors had empirical coverage percentages reasonably close to the nominal level. 
These observations were invariant to the error distributions. 

Table [2] summarizes the results for the very rare disease case with censoring rate 97% and 
full cohort size 3000. The results for full cohort size 1500 were similar and not reported. The 
two point estimates, their empirical standard errors, and their average bootstrap standard 
errors still agree with each other. The bootstrap standard errors for case-cohort size 150, 
however, are underestimating the true variation, and as a result, the 95% confidence intervals 
had coverage percentage smaller than the nominal level. Not surprisingly, the six sandwich 
variance estimators performed no better than the two multiplier bootstrap variance esti- 
mators. When the case-cohort size was increased to 300, all variance estimators performed 
reasonably well in estimating the true variation and the coverage percentage was reasonably 
close to the nominal level. Among all the sandwich variance estimators, the IS-MB and ZL- 
MB approaches seem to provide confidence intervals with the best coverage percentage. The 
SH-MB approach is slight inferior, which might be explained by the fact that this approach 
has two layers of approximation — one from asymptotic linear approximation and the other 
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Table 1: Summary of simulation results based on 1000 replications for full cohort size 1500 and censoring rate 90%. The 
bootstrapping size is 500 for each replication. PE is average of point estimates; ESE is the empirical standard deviation of the 
parameter estimates; ASE is the average of the standard error of the estimator; CP is the coverage percentage of 95% confidence 
interval. 



Error PE ESE ASE CP(%) 

LP IS LP IS MB IS SH ZL MB IS SH ZL 

LP IS CF MB CF MB CF MB LP IS CF MB CF MB CF MB 

N ft 0.997 1.000 0.170 0.170 0.161 0.161 0.161 0.163 0.155 0.157 0.161 0.163 93.4 93.8 94.0 94.2 93.0 93.5 94.0 94.5 

2 1-004 1.009 0.090 0.090 0.089 0.089 0.089 0.090 0.086 0.087 0.089 0.090 93.9 93.8 94.0 94.3 93.0 93.5 94.0 94.1 

3 1.000 1.004 0.093 0.093 0.089 0.088 0.089 0.090 0.102 0.103 0.089 0.090 93.8 94.0 94.1 94.3 96.6 96.7 94.0 94.6 
L 0! 0.998 1.000 0.284 0.284 0.274 0.274 0.273 0.275 0.269 0.271 0.273 0.275 94.4 94.5 94.6 94.9 93.9 94.2 94.4 94.7 

2 1.008 1.011 0.150 0.150 0.149 0.149 0.148 0.149 0.148 0.149 0.148 0.149 94.9 94.9 94.6 94.6 94.1 94.8 94.7 94.9 

/3 3 1.012 1.015 0.151 0.151 0.149 0.149 0.149 0.150 0.159 0.160 0.149 0.150 94.2 94.4 94.0 94.0 95.9 96.1 93.8 94.6 

G 0i 0.999 1.003 0.148 0.148 0.142 0.143 0.143 0.145 0.137 0.139 0.143 0.145 94.7 94.9 94.4 94.5 93.1 93.2 94.4 94.7 

2 0.999 1.004 0.082 0.082 0.079 0.079 0.079 0.080 0.075 0.077 0.079 0.080 93.2 93.6 93.6 94.1 91.9 92.3 93.6 94.2 

03 1.001 1.006 0.082 0.082 0.079 0.079 0.079 0.081 0.093 0.094 0.079 0.081 93.3 92.7 93.3 93.6 97.3 97.2 93.2 94.2 



Table 2: Summary of simulation results based on 1000 replications for full cohort size 3000 and and censoring rate 97%. The 
bootstrapping size is 500 for each replication. PE is average of point estimates; ESE is the empiricalstandard deviation of the 
parameter estimates; ASE is the average of the standard error of the estimator; CP is the coverage percentage of 95% confidence 
interval. 



Error (3 PE ESE ASE CP(%) 

LP IS LP IS MB IS SH ZL MB IS SH ZL 



LP IS CF MB CF MB CF MB LP IS CF MB CF MB CF MB 
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,8 




03 


1.013 


1 


,028 


0, 


.149 


0, 


,149 


0, 


.129 


0, 


.128 


0. 


.128 





.139 


0. 


.152 


0. 


.162 


0, 


.128 


0, 


.139 


92 


.2 


91 


.5 


89, 


,7 


92. 


,8 


93 


.7 


96 


.1 


89, 


.7 


92, 


,9 


case 


cohort size 


= 300: 








































































N 


0i 


1.002 


1 


,008 


0, 


.228 


0, 


,229 


0, 


.214 


0, 


,215 


0, 


,215 





,221 


0, 


,205 


0, 


,210 





.215 





.221 


92 


,9 


93 


.8 


92. 


,8 


93. 


,8 


91 


.3 


92, 


,1 


92, 


.8 


93. 


,9 




02 


1.005 


1 


,012 





.130 


0, 


,130 





.118 


0, 


,118 


0, 


,119 





,122 


0, 


111 


0, 


,114 





.119 





.122 


92, 


,8 


93 


.6 


93. 


,3 


94. 


,6 


91 


.2 


92 


,7 


93 


.4 


94, 


,0 




03 


0.998 


1 


,005 


0, 


.129 


0, 


,129 


0, 


.118 


0, 


,118 


0, 


,118 





,122 


0. 


,144 


0. 


,148 





.118 





.122 


92, 


,1 


92, 


.6 


92, 


,6 


93. 


3 


97 


.0 


97 


,0 


93 


.0 


92. 


,9 


L 




1.048 


1 


,050 





.394 


0, 


,395 





.373 


0, 


373 


0, 


368 





,373 


0, 


362 


0, 


366 





.369 





.373 


93 


,7 


93 


.8 


93 


,2 


93. 


,5 


92, 


.7 


92, 


.9 


93 


.2 


93. 


,9 




02 


1.031 


1 


,035 


0, 


.221 


0, 


,222 


0, 


.205 


0, 


,205 


0, 


,204 





,207 


0, 


,204 


0, 


,207 





.204 





.207 


92 


,1 


92 


.9 


92, 


,6 


93. 


,0 


92 


.9 


92 


,9 


92 


,6 


92. 


,8 




03 


1.026 


1 


.029 





.210 


0, 


,210 





.202 


0, 


,202 


0, 


,202 





.204 


0, 


,219 


0, 


,221 





.202 





.204 


93 


.8 


93 


.6 


93. 


,2 


93. 


,6 


94, 


.5 


95, 


.1 


93 


.1 


93 


,3 


G 


01 


1.010 


1 


.016 





.188 


0. 


,189 





.179 


0, 


,179 


0. 


,181 


0, 


.186 


0, 


,171 


0, 


,177 





.181 





.187 


93 


,2 


92, 


.9 


93. 


,4 


94. 


,4 


91, 


.2 


92, 


.7 


93 


.2 


94, 


.3 




02 


1.007 


1 


.015 





.107 


0, 


,108 





.099 


0, 


098 


0, 


099 


0, 


.102 


0, 


092 


0, 


,095 





.099 





.102 


91, 


.9 


92, 


.6 


92. 


4 


93. 


,3 


90 


.1 


91, 


.4 


92 


.3 


93 


.0 




03 


1.009 


1 


.017 





.109 


0, 


109 





.100 


0, 


099 


0. 


,100 





.103 


0, 


,124 


0, 


,127 





.100 





.103 


92, 


.4 


92 


.0 


92, 


4 


93. 


,4 


96 


.6 


97, 


.1 


92, 


.2 


93 


.4 



Table 3: Summary of timing results in seconds with both point estimation and variance 
estimation from the simulation study. 



C v m Error PE Variance 







LP 


IS 


MB 




IS 


SH 


ZL 


LP 


IS 


CF 


MB 


CF 


MB 


CF 


MB 


Full cohort 


size = 


1500 




















90% 300 


N 


7.2 


1.6 


2007.5 


561.3 


2.9 


10.9 


2.9 


11.4 


2.1 


11.6 




L 


6.5 


1.5 


1708.0 


499.9 


2.8 


10.2 


2.9 


10.8 


2.2 


10.9 




G 


6.9 


1.6 


1899.7 


544.6 


2.8 


10.4 


2.8 


10.9 


2.0 


11.1 


Full cohort 


size = 


3000 




















97% 150 


N 


0.8 


0.6 


183.4 


150.2 


0.4 


2.8 


0.5 


3.0 


0.6 


3.0 




L 


0.7 


0.4 


143.7 


118.2 


0.3 


2.5 


0.5 


2.6 


0.6 


2.7 




G 


1.1 


0.7 


262.3 


191.6 


0.5 


3.5 


0.7 


3.7 


0.7 


3.7 


300 


N 


3.6 


0.9 


629.6 


301.7 


2.1 


5.5 


2.2 


5.8 


1.3 


5.8 




L 


3.3 


0.7 


544.9 


237.1 


2.0 


4.8 


2.1 


5.1 


1.3 


5.2 




G 


4.1 


1.2 


816.8 


367.8 


2.2 


6.5 


2.3 


6.8 


1.4 


6.9 



from induced smoothing. 

Of more interest is Table [3J which summarizes the timing results in seconds averaged from 
1000 replicates for both point estimation and variance estimation on a 2GHz linux machine. 
For point estimation with full cohort size 1500 and censoring percentage C. p = 0.90%, the IS 
approach was up to 4.5 times as fast as the LP approach (with normal error distribution). 
The multipler bootstrap variance estimation with the IS approach was up to 3.6 times as 
fast as the LP approach (again with normal error). Nevertheless, the multiplier bootstrap 
IS approach still needed about 9 minutes on average to obtain a variance estimator. All 
sandwich variance estimators are strikingly much faster, especially with the closed-form ap- 
proach: ZL-CF approach took about 2 seconds on average; the IS-CF and SH-CF approaches 
took about 3 seconds on average. For each sandwich variance estimator, the version with 
CF estimation of V is over 5 times faster than the version with MB estimation of V. Using 
the LP approach as benchmark, the IS-CF and SH-CF estimators is 695 times faster and the 
ZL-CF estimators 1003 times faster. Since the performance of all variance estimators are 
similar for this setting, the IS-CF, SH-CF and ZL-CF approaches are obviously preferred for 
this setup with a mildly rare event. 

The timing results for full cohort size 3000 and censoring percentage C p = 0.97% follow 
a similar pattern. Compare to case C p = 0.90%, time for point estimation is shorter because 
the number of cases decreases in the case C v = 0.97% even when the average case-cohort size 
were both at 300. Sandwich variance estimators with CF estimation of V is up to 8 times 
faster than with those with MB estimation of V, at the expense of slightly worse performance 
in coverage percentage. The IS-MB and ZL-MB approaches yield the most reliable variance 
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estimates but IS-MB is slightly faster than faster than ZL-MB. As the average case-cohort 
size doubles, the computing time of the sandwich variance estimates with MB estimation 
for V appear to double accordingly, in contrast to those with CF estimation for V, which 
do not necessarily double linearly. In summary, based on the performance and speed, our 
recommended variance estimator is the IS-MB estimator. 



5 National Wilm's Tumor Study 

We demonstrate the performance of our proposed methods with an applicati on to the cohort 



study conducted by the National Wilm's Tumor Study Group (NWTSG) ( ID'Angio et al. 



19891 ; iGreen et all Il998l ). Wilm's tumor is a rare kidney cancer in young children. The 
interest of the study was to assess the relationship between the tumor histology and the 
outcome, time to tumor relapse. Tumor histology can be classified into two categories, 
favorable or unfavorable, depending on the cell type. The central histological diagnosis was 
made by an individual pathologist at the central pathology center, which was believed to 
be more accurate than a local diagnosis yet more expensive to measure and required more 
efforts to obtain. Although in the full version of the data, the central histology measurement 
was available for all the cohort members, it was only available for a case-cohort sample 
in the case-cohort version. We take advantage of the full version in this example. Other 
covariates that were available for all cohort members were patient age, disease stage and 
study group. According to the staging system employed by NWTSG, four stages (I - IV) of 
Wilms' tumors, with Stage IV as the latest stage, indicated the spread of the tumor. Each 
subject came from one of the two study groups, NWT SG-3 and NWTSG-4 . The case- cohort 



versio n of the data was analyzed with Cox models (IBreslow et all 120091 ; iKulich and Lin 



20041 ) and additive hazards models ( IKulich and Linl . 120001 ). respectively. 



There were a total of 4028 subjects in the full cohort. Among them, 571 were cases who 
experienced the relapse of tumor — a censoring rate of about 86%. We considered an AFT 
model for the time to relapse with the following covariates: central histology measurement 
(1 = favorable, = unfavorable), age (measure in year) at diagnosis, three tumor stages 
indicators (Stage I as reference) and a study group indicator (NWTSG-3 as reference). The 
case-cohort version of the data had 668 patients selected as sub-cohort sample and the total 
case-cohort sample size was 1154. To take advantage of availability of full cohort data, we 
drew 1000 new sub-cohort samples with size 668 and formed a case-cohort by including the 
remaining cases for each replicate. We then averaged these estimates and estimated standard 
errors from the 1000 replicates of the case-cohort analysis. 

The results of the average from 1000 replicates of case-cohort analyses are summarized in 
Table HJ Due to its poor timing performance, the LP approach was not considered. Since the 
MB standard error is considered to reflect the true variation quite well from the simulation 
study, we are interested in how close the various sandwich standard errors to the MB stan- 
dard error. For all three sandwich estimators, the CF versions systematically underestimate 
noticeably, although the underestimation is less severe in the IS and ZL approach than in 
the SH approach. The MB versions of the sandwich estimates appear to agree with the 
MB standard error closely, and again the agreement appears to be better for the IS and ZL 
approach than for the SH approach. In particular, the SH standard error for the age effect is 
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Table 4: National Wilm's tumor study and timing results in seconds. 



PE SE 
Effects IS MB IS SH ZL 

IS CF MB CF MB CF MB 



Case-Cohort Analysis: 



(time) 


(8.5) 


(3682.2) 


(7.7) 


(13.9) 


(9.5) 


(15.7) 


(9.4) 


(15.0) 


histol 


-3.428 


0.465 


0.409 


0.458 


0.372 


0.423 


0.410 


0.458 


age 


-0.190 


0.079 


0.074 


0.080 


0.221 


0.243 


0.074 


0.080 


stage2 


-1.283 


0.613 


0.590 


0.621 


0.516 


0.544 


0.590 


0.622 


stage3 


-1.401 


0.612 


0.579 


0.616 


0.572 


0.602 


0.580 


0.616 


stage4 


-2.092 


0.717 


0.665 


0.712 


0.717 


0.763 


0.666 


0.712 


study 


-0.128 


0.475 


0.455 


0.484 


0.451 


0.482 


0.455 


0.484 


Full-Cohort Analysis: 
















(time) 


(266.0) ( 


126927.7) 


(309.9) 


(453.0) 


(341.3) 


(486.1) 


(321.1) 


(494.0) 


histol 


-2.749 


0.202 


0.148 


0.213 


0.138 


0.214 


0.148 


0.196 


age 


-0.127 


0.037 


0.029 


0.039 


0.081 


0.092 


0.029 


0.039 


stage2 


-1.335 


0.280 


0.233 


0.285 


0.200 


0.280 


0.234 


0.271 


stage3 


-1.341 


0.286 


0.239 


0.297 


0.211 


0.288 


0.240 


0.299 


stage4 


-2.203 


0.319 


0.245 


0.321 


0.219 


0.300 


0.247 


0.334 


study 


-0.106 


0.226 


0.175 


0.229 


0.162 


0.224 


0.176 


0.219 
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about three times as much as that from other approaches. The standard errors from IS-MB 
and ZL-MB are almost identical, both very close to the time consuming MB standard error. 
Based on the IS-MB standard errors, the coefficients of central histological diagnosis, age, 
and all three stage indicators were found to be significantly different from zero with p-values 
0.000, 0.009, 0.019, 0.011 and 0.002, respectively. No significant difference was found be- 
tween the two study groups. In terms of timing, the MB-IS standard error took over a hour 
whereas the MB-based sandwich estimates only took 14-15 seconds on average. 

For comparison purpose, we also analyzed the full cohort data with the same approaches 
and reported the results in Table HI Point estimates are close to these in case-cohort anal- 
ysis, with their standard errors taken into consideration. All the standard errors decrease 
compared to the case-cohort analyses, which is expected as full information became avail- 
able for all covariates. The best sandwich variance estimators are still IS-MB and ZL-MB, 
both closely approximates the full blown MB standard error. With full cohort size 4028 
and censoring rate 86%, the IS point estimates took 4 and a half minutes, the MB variance 
estimation took 35.36 hours, while IS-MB only took 7 and a half minutes. 



6 Discussion 

In AFT modeling of case-cohort data, both point estimation and variance estimation are 
challenging with the nonsmooth estimating equations. Resampling methods are commonly 
used to estimate the variance, which are time consuming even with a computationally effi- 
cient point estimator such as our induced smoothing approach with rank-based estimating 
equations. We have proposed six sandwich variance estimators and compared their per- 
formances with the bootstrap variance estimator in numerical studies. The IS-MB and 
ZL-MB approaches were found to provide good approximation to the true variation and are 
computationally ve ry efficient. All the methods are implemented in an R package aft gee 



( iChiou et all 120121 ). The package had the potential to bring AFT modeling of case-cohort 
data into routine analysis. 

The IS approach was built on Gehan's weight for rank-based estimating equations, in 
which case closed-form expectations of the perturbed estimating equations are available. 
Alternative weights such as the logrank weight are possible, though the computation is less 
straightforward than that for Gehan's weight. Incorporating a general, possibly optimized 
weight in the IS approach merits further investigation for both full cohort and case-cohort 
data. The estimates from Gehan's weight always serve as a good initial value in numerical 
equations solving. 

Some extensions of the proposed methods are worth considering. Auxiliary covariates 
are often available for the entire cohort, which can be used to construct strata for subcohort 
members selection. The resulting estimators from the stratified case-cohort design has been 
shown to be more effic ient than their traditional case-c ohort counterpart for the Cox model 



( IKulich and Linl . l2004j ) and the additive hazards model (IKulich and Linl . l2000[ ) . An extension 
of the proposed methods with the AFT model to a stratified case-cohort design may lead 
to efficiency improvement too. When more than one diseases are considered in a case- 
cohort design, a multivariate extension will be needed, and a possible depende nce among the 



multivariate failure times needs to be taken into account. For the Cox model, iKang and Cai 
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(120091 ) used a marginal approach. A similar approach can be considered for the AFT model. 



A Analytical Details 

We give the a nalytical f orm o f Si(/3)'s here. Define the general rank based weighted estimat- 
ing function (IJin et all 120031 ) 



8=1 

where ip n ,i(0) is an nonnegative weight function and 

1 n 

W<$tf) = - ^Xftetf) > emi k = 0, 1. 



Equation (pQ) can be obtained by setting <^ n> j(/3) = W^(/3). On the other hand, the general 
rank based weighted estimating function for case-cohort samples has the following form: 



8=1 



X- 



where 



1 n 

fr2$(P) = -Y, h i x i I foW* e * 



k = 0,1. 



Similarly, equation fl2]) can be obtained by setting ip nti ((3) = W^(j3). 
With these settings, an explicit form of Si((3 ) is 



Xi- 



=A^(°)(/3 



where 



X, 



w {k) {(3) = lim VO/3), for fc = 0, 1 



w (1) (A)) 



dMi{t) 

ei(/3) 



^ (0) (/3o) 



w (1) (A)) 



A(t) dt, 



Mi(*) = Nitf; t) - / J( ei (/3) > u)X(u) du, 
Jo 

Ni(P;t) = AjJ(ej(/3) < t) and X(u) is the common hazard function of e,. 

The unknown quantities in S;(/3o) include /?o, w^°\ and A(£). With the explicit form 
of Si(po), Si(f3) is obtained by replacing these unknown quantities by their sample estimators. 
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